fix(config): passing gradient_checkpoint_kwargs #1412

NanoCode012 · 2024-03-16T05:13:14Z

According to huggingface/transformers#28339 , setting it to False increases VRAM. My quick testing shows ~1GB increase at lowest settings.

Furthermore, the default in transformers and torch is going to be True huggingface/transformers#29638 (comment)

Finally, removing the default to False in trainer_building to clean old configs. I see that this kwarg is now set in https://github.com/OpenAccess-AI-Collective/axolotl/blob/a914cb37dc455a3fd0368e3a0898867f25b3a6c9/src/axolotl/utils/config/__init__.py#L170-L176

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

winglian · 2024-03-16T12:03:50Z

In my experience, use_reentrant: False used less VRAM prevented OOM in some situations where use_reentrant: True would OOM, hence the current default.

winglian · 2024-03-16T12:05:22Z

see my comment in #1167

NanoCode012 · 2024-03-16T12:06:45Z

In that code block above, it seems that the validate_config which runs first, sets it to True by default already.

Then, the trainer_builder which runs second, redundantly checks for it again and tries to set it to False.

Either way, I think one of the defaults should be removed to prevent future confusion.

Edit: your linked PR sets it to true, despite the comment saying it's false.

winglian · 2024-03-18T13:08:29Z

In that code block above, it seems that the validate_config which runs first, sets it to True by default already.

Then, the trainer_builder which runs second, redundantly checks for it again and tries to set it to False.

Either way, I think one of the defaults should be removed to prevent future confusion.

Edit: your linked PR sets it to true, despite the comment saying it's false.

Thanks for digging into this. Good to go!

* fix(config): change default use_reentrant to true * Update trainer_builder.py * fix: make sure to pass kwargs to enable checkpoint * chore: lint

NanoCode012 added 3 commits March 16, 2024 14:11

fix(config): change default use_reentrant to true

c7a9e76

Update trainer_builder.py

34211ff

fix: make sure to pass kwargs to enable checkpoint

a914cb3

NanoCode012 changed the title ~~fix(config): change default use_reentrant to true~~ fix(config): passing gradient_checkpoint_kwargs Mar 16, 2024

NanoCode012 marked this pull request as ready for review March 16, 2024 05:25

chore: lint

c9fe327

winglian approved these changes Mar 18, 2024

View reviewed changes

NanoCode012 merged commit b1e3e1b into main Mar 19, 2024
6 checks passed

NanoCode012 deleted the NanoCode012-checkpointing branch March 19, 2024 03:57

djsaunde pushed a commit that referenced this pull request Dec 17, 2024

fix(config): passing gradient_checkpoint_kwargs (#1412)

e4b51b1

* fix(config): change default use_reentrant to true * Update trainer_builder.py * fix: make sure to pass kwargs to enable checkpoint * chore: lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(config): passing gradient_checkpoint_kwargs #1412

fix(config): passing gradient_checkpoint_kwargs #1412

NanoCode012 commented Mar 16, 2024 •

edited

Loading

winglian commented Mar 16, 2024

winglian commented Mar 16, 2024

NanoCode012 commented Mar 16, 2024 •

edited

Loading

winglian commented Mar 18, 2024

fix(config): passing gradient_checkpoint_kwargs #1412

fix(config): passing gradient_checkpoint_kwargs #1412

Conversation

NanoCode012 commented Mar 16, 2024 • edited Loading

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

winglian commented Mar 16, 2024

winglian commented Mar 16, 2024

NanoCode012 commented Mar 16, 2024 • edited Loading

winglian commented Mar 18, 2024

NanoCode012 commented Mar 16, 2024 •

edited

Loading

NanoCode012 commented Mar 16, 2024 •

edited

Loading